15 research outputs found
Social Scene Understanding: End-to-End Multi-Person Action Localization and Collective Activity Recognition
We present a unified framework for understanding human social behaviors in
raw image sequences. Our model jointly detects multiple individuals, infers
their social actions, and estimates the collective actions with a single
feed-forward pass through a neural network. We propose a single architecture
that does not rely on external detection algorithms but rather is trained
end-to-end to generate dense proposal maps that are refined via a novel
inference scheme. The temporal consistency is handled via a person-level
matching Recurrent Neural Network. The complete model takes as input a sequence
of frames and outputs detections along with the estimates of individual actions
and collective activities. We demonstrate state-of-the-art performance of our
algorithm on multiple publicly available benchmarks
Variational Methods for Human Modeling
A large part of computer vision research is devoted to building models
and algorithms aimed at understanding human appearance and behaviour
from images and videos. Ultimately, we want to build automated systems
that are at least as capable as people when it comes to
interpreting humans. Most of the tasks that we want these systems to
solve can be posed as a problem of inference in probabilistic
models. Although probabilistic inference in general is a very hard
problem of its own, there exists a very powerful class of inference
algorithms, variational inference, which allows us to build efficient
solutions for a wide range of problems.
In this thesis, we consider a variety of computer vision problems
targeted at modeling human appearance and behaviour, including
detection, activity recognition, semantic segmentation and facial
geometry modeling. For each of those problems, we develop novel methods
that use variational inference to improve the capabilities
of the existing systems.
First, we introduce a novel method for detecting multiple potentially
occluded people in depth images, which we call DPOM. Unlike many other
approaches, our method does probabilistic reasoning jointly,
and thus allows to propagate knowledge about one part of the image
evidence to reason about the rest. This is particularly
important in crowded scenes involving many people, since it helps to
handle ambiguous situations resulting from severe occlusions. We
demonstrate that our approach outperforms existing methods on multiple
datasets.
Second, we develop a new algorithm for variational inference that
works for a large class of probabilistic models, which includes, among
others, DPOM and some of the state-of-the-art models for semantic
segmentation. We provide a formal proof that our method converges,
and demonstrate experimentally that it brings better performance than
the state-of-the-art on several real-world tasks, which include
semantic segmentation and people detection. Importantly, we show that
parallel variational inference in discrete random fields can be seen
as a special case of proximal gradient descent, which allows us to
benefit from many of the advances in gradient-based optimization.
Third, we propose a unified framework for multi-human scene
understanding which simultaneously solves three tasks: multi-person
detection, individual action recognition and collective activity
recognition. Within our framework, we introduce a novel multi-person
detection scheme, which relies on variational inference and
jointly refines detection hypotheses instead of relying on
suboptimal post-processing. Ultimately, our model takes as an inputs a
frame sequence and produces a comprehensive description of the
scene. Finally, we experimentally demonstrate that our method brings
better performance than the state-of-the-art.
Fourth, we propose a new approach for learning facial geometry with
deep probabilistic models and variational methods. Our model is based
on a variational autoencoder with multiple sets of hidden variables,
which are capturing various levels of deformations, ranging from
global to local, high-frequency ones. We experimentally demonstrate
the power of the model on a variety of fitting tasks. Our model is
completely data-driven and can be learned from a relatively small
number of individuals
NPC: Neural Point Characters from Video
High-fidelity human 3D models can now be learned directly from videos,
typically by combining a template-based surface model with neural
representations. However, obtaining a template surface requires expensive
multi-view capture systems, laser scans, or strictly controlled conditions.
Previous methods avoid using a template but rely on a costly or ill-posed
mapping from observation to canonical space. We propose a hybrid point-based
representation for reconstructing animatable characters that does not require
an explicit surface model, while being generalizable to novel poses. For a
given video, our method automatically produces an explicit set of 3D points
representing approximate canonical geometry, and learns an articulated
deformation model that produces pose-dependent point transformations. The
points serve both as a scaffold for high-frequency neural features and an
anchor for efficiently mapping between observation and canonical space. We
demonstrate on established benchmarks that our representation overcomes
limitations of prior work operating in either canonical or in observation
space. Moreover, our automatic point extraction approach enables learning
models of human and animal characters alike, matching the performance of the
methods using rigged surface templates despite being more general. Project
website: https://lemonatsu.github.io/npc/Comment: Project website: https://lemonatsu.github.io/npc
Masksembles for Uncertainty Estimation
Deep neural networks have amply demonstrated their prowess but estimating the
reliability of their predictions remains challenging. Deep Ensembles are widely
considered as being one of the best methods for generating uncertainty
estimates but are very expensive to train and evaluate. MC-Dropout is another
popular alternative, which is less expensive, but also less reliable. Our
central intuition is that there is a continuous spectrum of ensemble-like
models of which MC-Dropout and Deep Ensembles are extreme examples. The first
uses an effectively infinite number of highly correlated models while the
second relies on a finite number of independent models.
To combine the benefits of both, we introduce Masksembles. Instead of
randomly dropping parts of the network as in MC-dropout, Masksemble relies on a
fixed number of binary masks, which are parameterized in a way that allows to
change correlations between individual models. Namely, by controlling the
overlap between the masks and their density one can choose the optimal
configuration for the task at hand. This leads to a simple and easy to
implement method with performance on par with Ensembles at a fraction of the
cost. We experimentally validate Masksembles on two widely used datasets,
CIFAR10 and ImageNet
Drivable 3D Gaussian Avatars
We present Drivable 3D Gaussian Avatars (D3GA), the first 3D controllable
model for human bodies rendered with Gaussian splats. Current photorealistic
drivable avatars require either accurate 3D registrations during training,
dense input images during testing, or both. The ones based on neural radiance
fields also tend to be prohibitively slow for telepresence applications. This
work uses the recently presented 3D Gaussian Splatting (3DGS) technique to
render realistic humans at real-time framerates, using dense calibrated
multi-view videos as input. To deform those primitives, we depart from the
commonly used point deformation method of linear blend skinning (LBS) and use a
classic volumetric deformation method: cage deformations. Given their smaller
size, we drive these deformations with joint angles and keypoints, which are
more suitable for communication applications. Our experiments on nine subjects
with varied body shapes, clothes, and motions obtain higher-quality results
than state-of-the-art methods when using the same training and test data.Comment: Website: https://zielon.github.io/d3ga
Modeling Facial Geometry using Compositional VAEs
We propose a method for learning non-linear face geometry representations using deep generative models. Our model is a variational autoencoder with multiple levels of hidden variables where lower layers capture global geometry and higher ones encode more local deformations. Based on that, we propose a new parameterization of facial geometry that naturally decomposes the structure of the human face into a set of semantically meaningful levels of detail. This parameterization enables us to do model fitting while capturing varying level of detail under different types of geometrical constraints.
Dressing Avatars: Deep Photorealistic Appearance for Physically Simulated Clothing
Despite recent progress in developing animatable full-body avatars, realistic
modeling of clothing - one of the core aspects of human self-expression -
remains an open challenge. State-of-the-art physical simulation methods can
generate realistically behaving clothing geometry at interactive rates.
Modeling photorealistic appearance, however, usually requires physically-based
rendering which is too expensive for interactive applications. On the other
hand, data-driven deep appearance models are capable of efficiently producing
realistic appearance, but struggle at synthesizing geometry of highly dynamic
clothing and handling challenging body-clothing configurations. To this end, we
introduce pose-driven avatars with explicit modeling of clothing that exhibit
both photorealistic appearance learned from real-world data and realistic
clothing dynamics. The key idea is to introduce a neural clothing appearance
model that operates on top of explicit geometry: at training time we use
high-fidelity tracking, whereas at animation time we rely on physically
simulated geometry. Our core contribution is a physically-inspired appearance
network, capable of generating photorealistic appearance with view-dependent
and dynamic shadowing effects even for unseen body-clothing configurations. We
conduct a thorough evaluation of our model and demonstrate diverse animation
results on several subjects and different types of clothing. Unlike previous
work on photorealistic full-body avatars, our approach can produce much richer
dynamics and more realistic deformations even for many examples of loose
clothing. We also demonstrate that our formulation naturally allows clothing to
be used with avatars of different people while staying fully animatable, thus
enabling, for the first time, photorealistic avatars with novel clothing.Comment: SIGGRAPH Asia 2022 (ACM ToG) camera ready. The supplementary video
can be found on
https://research.facebook.com/publications/dressing-avatars-deep-photorealistic-appearance-for-physically-simulated-clothing